Business Data Analytics Project
18MCMI05: Delton M Antony, MTech Artificial Intelligence
18MCMI14: Garima Jain, MTech Artificial Intelligence
Churn Prediction Dataset
Source: https://www.kaggle.com/blastchar/telco-customer-churn
Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The raw data contains 7043 rows (customers) and 21 columns (features). The “Churn” column is our target
Classification:
Classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. Here, in the telecom churn prediction dataset, the problem is to predict whether the customer is going to churn or not.
Loading the data:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# read into a pandas dataframe
data = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
View the first few records
data.head()
check for the datatype of the features and information
data.info()
Data Preprocessing: We need to correct the datatype of features, check and impute for null values, perform data partitioning, handle categorical data, perform feature selection and feature scaling
Note that the total charges in a non-null object. We need to convert it into a numerical datatype.
data.TotalCharges = pd.to_numeric(data.TotalCharges, errors='coerce')
Now let us check for null values.
nulls = data.isnull().sum()
nulls[nulls > 0]
There are eleven null values in 'TotalCharges'. Let us impute these null values with zeroes
data.fillna(0,inplace=True)
The 'no phone service' value in the above MultipleLines independent variable can be treated as 'No'
data['MultipleLines'].replace('No phone service','No',inplace=True)
Let y be the vector of dependent variable values of "Churn" and Let X be the matrix heading all the independent variables. Split the data into train and target data
y = data['Churn'].map({'Yes':1,'No':0})
X = data.drop(labels=['Churn','customerID'],axis=1).copy()
Tenure, MonthlyCharges and TotalCharges are not categorical. Every other column is categorical. Hence, we need to convert the categorical columns into binary. First find the list of categorical columns for encoding.
cat_cols = []
for column in X.columns:
if column not in ['tenure','MonthlyCharges','TotalCharges']:
cat_cols.append(column)
cat_cols
The above columns are categorical columns. Now, you can convert these into binary by either of the two means - OneHotEncoder or pandas.get_dummies(). Encoding the categorical columns with the pandas get_dummies() method.
X= pd.get_dummies(X,columns=cat_cols)
X.info()
The above are the columns ie independent variables we got after handling categorical variables.
Now we have to perform feature selection in the data. Out of the many multiple ways to perform feature selection, I am using Backward Elimination using OLS ie Ordinary Least Squares. I am performing it manually so that I can see the output of the p value after removing each feature. This is done by using the summary() method.
import statsmodels.formula.api as sm
Append a column of ones as the 0th column so that it can work.
X = np.append(arr= np.ones((7043,1)).astype(int), values = X, axis=1)
Now the preparation to perform backward elimination is complete. The procedure is to fit using OLS and check the summary to see the p value. During backward elimination, all the features are taken all at once at first. Then features are dropped one by one by dropping the feature with the highest p value.
X_opt = X
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p value is for col 42
#[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 43, 44, 45]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p value is for col 14, so delete that
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p value is for col 20, so delete that
X_opt = X_opt[:,[0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p s for col 32, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 28, delete it
#X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 35, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 37, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 38, 39]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 38, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 28, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 19, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 5, delete it
X_opt = X_opt[:,[0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 32, delete it. Highest p was .452
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 11, delete it. Highest p was .178
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 10, delete it. Highest p was .083
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 29, delete it. Highest p was .019
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 30, 31]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 9, delete it. Highest p was .013
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 8, delete it. Highest p was .075
#X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]]
X_opt = X_opt[:,[0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit() # OrdinaryLeastSquares
regressor_OLS.summary() # Highest p is for col 8, delete it. Highest p was .01
We filtered out the features with large p values until the maximum p value became 0.01. Now we are left with 28 features. The only thing left here is to overwrite our input matrix with this optimal matrix.
X = X_opt
Feature Scaling: It is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range. Sometimes, it also helps in speeding up the calculations in an algorithm. Here we are standardizing the values.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
X
It is evident that the data got scaled from the above representation of X.
Data partitioning: Now, split the data into training set and test set. It is split in the ratio 7:3. The resultant data is stored into training data (X_train, y_train) and test data (X_test, y_test).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
With this, the Data Preprocessing stage is now complete.
Modeling:
This is a classification problem. It means that we need to classify each output into two classes in this case, which is called binary classification. For binary classification, we can apply Logistic regression, Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, Multi Layer Perceptron and Light and Extreme Gradient Boosting methods to make respective models that can classify the records into Churn or Not Churn. The hyperparameters applied to the models are selected using grid search done off record. In some models we can note that it gives best results with default parameters.
Logistic Regression: With logistic regression, we are trying to fit the data to the logistic function. It is a discriminative classifier.
from sklearn.linear_model import LogisticRegression
logRegClassifier = LogisticRegression()
logRegClassifier.fit(X_train, y_train)
y_predLogReg = logRegClassifier.predict(X_test)
Now the vector y_predLogReg contains the predicted results. Evaluating the classification results can be done in two ways - by using a confusion matrix or by using an roc curve.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
logRegConfMatrix = confusion_matrix(y_test, y_predLogReg)
logRegConfMatrix
Here, accuracy = 80.075 %
However, accuracy is not good enough to calculate the goodness of model. What if the model simply predicts every customer as no churning - ie all predicted values are 0s. Then still the accuracy score will be high. Hence we need measures to make sure that the evaluation of a model is foolproof. That is where the auc comes into play. Area Under roc Curve gives a better perspective into the dependability of a model.
fpr, tpr, thresholds = roc_curve(y_predLogReg, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
AUC for logistic regression model is 0.74 and the accuracy is 80.075%
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predLogReg))
Precision: What proportion of positive identifications was actually correct? Precision is defined as TP/(TP+FP)
Recall: What proportion of actual positives was identified correctly? Recall is defined as TP/(TP+FN) Recall is also called Sensitivity
f1-score: The f1-score gives you the harmonic mean of precision and recall. The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes. f1-score = 2(RecallPrecision)/(Recall+Precision)
Naive Bayes Classification: Naive Bayes is a generative probability classifier. It works using the principle of Bayes Theorem for conditional probability under the assumption that the occurence of one feature has got nothing to do with the occurence of another in a record. This assumption is the reason why it is called Naive.
from sklearn.naive_bayes import GaussianNB
nbClassifier = GaussianNB()
nbClassifier.fit(X_train, y_train)
y_predNB = nbClassifier.predict(X_test)
Now the y_predNB vector contains the values predicted by naive bayes classifier. Let's evaluate the model using confusion matrix and roc
nbConfMatrix = confusion_matrix(y_test, y_predNB)
nbConfMatrix
Here, accuracy is 67.43 %
Plotting the roc curve and finding the area under roc
fpr, tpr, thresholds = roc_curve(y_predNB, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Naive Bayes (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predNB))
KNN Classification: KNN stands for k-nearest neighbor. KNN is a lazy, non parametric algorithm. Let us consider 5 neighbors and use euclidian distance (ie minkowski distance with p = 2)
from sklearn.neighbors import KNeighborsClassifier
KNNClassifier = KNeighborsClassifier(n_neighbors=20, p=2, metric='minkowski')
KNNClassifier.fit(X_train, y_train)
y_predKNN = KNNClassifier.predict(X_test)
Confusion matrix is below.
knnConfMatrix = confusion_matrix(y_test, y_predKNN)
knnConfMatrix
Here, accuracy = 78.845%
Let's plot the roc and find the auc
fpr, tpr, thresholds = roc_curve(y_predKNN, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='KNN (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predKNN))
Support Vector Machine Classification: Unlike other models, SVC does not use the entire dataset to train itself. it uses the tipping points - also know an support vectors. These are those points that define the boundary of the two classes. This is a rare case of dimension reduction with respect to the records as opposed to features.
svcClassifier = SVC(kernel='linear', random_state=0, cache_size=7000)
svcClassifier.fit(X_train, y_train)
y_predSVM = svcClassifier.predict(X_test)
The predictions made by SVM is stored in y_predSVM, it can be evaluated against the test set. The confusion matrix is as follows.
svcConfMatrix = confusion_matrix(y_test, y_predSVM)
svcConfMatrix
Plotting the roc and finding the auc
fpr, tpr, thresholds = roc_curve(y_predSVM, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='SVM (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predSVM))
Although this works, it takes a relatively long time to converge. SVM is computation intensive. Hence, we need to find a way to reduce the number of input dimensions so that the SVM fitting will converge faster.
Principal Component Analysis: Principal Component Analysis, commonly abbreviated as PCA is a method to reduce the input dimensions without performing feature selection. PCA merely performs a linear transformation of the input data from one form to another. Then generally we only consider the first few resultant principal components as most of the variance of the data will be covered by them. First let us test this on Logistic Regression and then we will apply this to SVM and see if the time taken by the SVM is less.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
#pca = PCA(.80)
pca.fit(X_train)
X_train_PCA = pca.transform(X_train)
X_test_PCA = pca.transform(X_test)
X_train_PCA
This now only has three columns. This is the result when we take only the first three principal components.
Let us perform logistic regression using the principal components we obtained above.
pcaLogRegClassifier = LogisticRegression()
pcaLogRegClassifier.fit(X_train_PCA, y_train)
y_pred_logReg_PCA = pcaLogRegClassifier.predict(X_test_PCA)
Let's see the confusion matrix of this.
pcaLogRegConfMatrix = confusion_matrix(y_test, y_pred_logReg_PCA)
pcaLogRegConfMatrix
Here, the accuracy is 78.466%
Plotting the roc and finding auc
fpr, tpr, thresholds = roc_curve(y_pred_logReg_PCA, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='PCA Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_pred_logReg_PCA))
Now, let us use these same principal components to fit an SVC model
from sklearn.svm import SVC
svcPCAclassifier = SVC(kernel='linear', random_state=0)
svcPCAclassifier.fit(X_train_PCA, y_train)
y_pred_svc_PCA = svcPCAclassifier.predict(X_test_PCA)
As exected, the SVM took a significantly less time to converge. Let us see the confusion matrix.
pcaSVCConfMatrix = confusion_matrix(y_test, y_pred_svc_PCA)
pcaSVCConfMatrix
Here, the accuracy is 78.655%
Plotting the roc and finding the auc
fpr, tpr, thresholds = roc_curve(y_pred_svc_PCA, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='SVM PCA (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_pred_svc_PCA))
from sklearn.svm import SVC
kernelSVCclassifier = SVC(kernel="rbf", degree=3, probability=True, random_state=0)
kernelSVCclassifier.fit(X_train, y_train)
y_predKernelSVM = kernelSVCclassifier.predict(X_test)
Let's see the confusion matrix
kernelSVMconfusionMatrix = confusion_matrix(y_test, y_predKernelSVM)
kernelSVMconfusionMatrix
Here, the accuracy is 79.791%
Plotting the roc and finding the auc
fpr, tpr, thresholds = roc_curve(y_predKernelSVM, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Kernel SVM (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predKernelSVM))
Decision Tree Classification: Decision Tree is the go to method if you want clear insights on how the classifier predicts and under what criteria. This gives clear rules followed by the classifier. Hence, it is not considered as a black box. We are using information gain as the criteria ie criteria = entropy
from sklearn.tree import DecisionTreeClassifier
decisionTreeClassifierEntropy = DecisionTreeClassifier(criterion="entropy", max_features=10)
decisionTreeClassifierEntropy.fit(X_train, y_train)
y_predDecTreeEntropy = decisionTreeClassifierEntropy.predict(X_test)
Let's see the confusion matrix
decTreeEntropyConfusionMatrix = confusion_matrix(y_test, y_predDecTreeEntropy)
decTreeEntropyConfusionMatrix
Here, the accuracy is 74.443%
Plotting the roc and finding the auc
fpr, tpr, thresholds = roc_curve(y_predDecTreeEntropy, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Decision Tree (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predDecTreeEntropy))
It is possible to visualize the generated decision tree but often, it might be too large to be displayed. Also, we need to install pydotplus on our development environment in order for the following snippet to work.
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dotData = StringIO()
export_graphviz(decisionTreeClassifierEntropy, out_file=dotData, filled=True, rounded=True, special_characters=True)
decTreeEntropyGraph = pydotplus.graph_from_dot_data(dotData.getvalue())
Image(decTreeEntropyGraph.create_png())
Fortunately, in this case, the graph can be viewed. Also, it is worth noting from the above graph that feature selection is implicit in decision tree, furthermore, the hyperparameter max_features can be used to restrict the features.
Random Forest Classification:
Random forest comes under a category called ensemble algorithms. Here, instead of one single decision tree, an army of decision trees are used and the results learned by them are then converged together to create the model. The hyperparameter n_estimators can be used to set the number of individual decision trees used in the forest. We are using information gain as criteria as usual.
from sklearn.ensemble import RandomForestClassifier
randomForestClassifier = RandomForestClassifier(n_estimators=10, criterion="entropy", max_features=15, random_state=0)
randomForestClassifier.fit(X_train, y_train)
y_predRandomForest = randomForestClassifier.predict(X_test)
Confusion matrix for the model is computed below
randomForestConfMatrix = confusion_matrix(y_test, y_predRandomForest)
randomForestConfMatrix
Here, accuracy is 78.75%
Plotting the roc and finding the auc
fpr, tpr, thresholds = roc_curve(y_predRandomForest, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Random Forest (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
As you can see, random forest is giving a much higher auc score compared to decision tree. Ten trees are better than one.
print(classification_report(y_test, y_predRandomForest))
Light Gradient Boosting Method: Light GBM is a gradient boosting framework that uses tree based learning algorithm. Light GBM grows tree vertically (leaf wise) while other algorithm grows trees horizontally (level wise). It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.
from lightgbm import LGBMClassifier
lgbmClassifier = LGBMClassifier(learning_rate=0.1, objective="binary", random_state=0, max_depth=12)
lgbmClassifier.fit(X_train, y_train)
y_predLGBM = lgbmClassifier.predict(X_test)
Confusion matrix for lgbm classifier
lgbmConfMatrix = confusion_matrix(y_test, y_predLGBM)
lgbmConfMatrix
Here, the accuracy is 79.981%
Plotting the roc and finding the auc
fpr, tpr, thresholds = roc_curve(y_predLGBM, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize = (10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Light Gradient Boosting (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predLGBM))
Xtreme Gradient Boosting: Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.
from xgboost import XGBClassifier
xgbClassifier = XGBClassifier(learning_rate=0.1, max_depth=4)
xgbClassifier.fit(X_train, y_train)
y_predXGB = xgbClassifier.predict(X_test)
Confusion Matrix for xgboost
xgbConfMatrix = confusion_matrix(y_test, y_predXGB)
xgbConfMatrix
Here, accuracy is 80.785%
fpr, tpr, thresholds = roc_curve(y_predXGB, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='XG Boost (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predXGB))
Multi Layer Perceptron: A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of, at least, three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. MLP can work with non-linearly-separable data.
from sklearn.neural_network import MLPClassifier
nnClassifier = MLPClassifier(activation="relu", solver="sgd", max_iter=85, random_state=0, verbose=True)
nnClassifier.fit(X_train, y_train)
y_predMLP = nnClassifier.predict(X_test)
Confusion Matrix
mlpConfMatrix = confusion_matrix(y_test, y_predMLP)
mlpConfMatrix
Here, accuracy is 79.933%
fpr, tpr, thresholds = roc_curve(y_predMLP, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(10,6))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='MLP Classifier (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
print(classification_report(y_test, y_predMLP))
Now we are done with all the models. Let us compare the models we generated by plotting the models' roc on the same graph.
fpr, tpr, thresholds = roc_curve(y_predLogReg, y_test)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(15, 10))
plt.plot(fpr, tpr, color='darkorange', lw=1, label='Logistic Regression (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--') # Straight
fpr, tpr, thresholds = roc_curve(y_predNB, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='blue', lw=1, label='Naive Bayes (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predKNN, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='green', lw=1, label='KNN (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_pred_logReg_PCA, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='cyan', lw=1, label='LogReg PCA (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predSVM, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='teal', lw=1, label='SVM (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_pred_svc_PCA, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='magenta', lw=1, label='SVC PCA (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predKernelSVM, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='red', lw=1, label='Kernel SVM (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predDecTreeEntropy, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='yellow', lw=1, label='Decision Tree (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predRandomForest, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='black', lw=1, label='Random Forest (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predLGBM, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='violet', lw=1, label='Light Gradient Boosting (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predXGB, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkgoldenrod', lw=1, label='XG Boosting (area = %0.2f)' % roc_auc)
fpr, tpr, thresholds = roc_curve(y_predMLP, y_test)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, color='chocolate', lw=1, label='MLP (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
We can see that, as expected, xgboost has the highest score. It is closely followed by MLP, LGBM, KernelSVM, SVM and of course Logistic Regression. The worst score is for Decision Tree and Naive Bayes.